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Abstract 

We study the properties of the Minimum Description Length principle for 
sequence prediction, considering a two-part MDL estimator which is chosen 
from a countable class of models. This applies in particular to the important 
case of universal sequence prediction, where the model class corresponds to 
all algorithms for some fixed universal Turing machine (this correspondence 
is by enumerable semimeasures, hence the resulting models are stochastic). 
We prove convergence theorems similar to Solomonoff 's theorem of universal 
induction, which also holds for general Bayes mixtures. The bound charac- 
terizing the convergence speed for MDL predictions is exponentially larger as 
compared to Bayes mixtures. We observe that there are at least three different 
ways of using MDL for prediction. One of these has worse prediction prop- 
erties, for which predictions only converge if the MDL estimator stabilizes. 
We establish sufficient conditions for this to occur. Finally, some immediate 
consequences for complexity relations and randomness criteria are proven. 
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1 Introduction 



The Minimum Description Length (MDL) principle is one of the most important 
concepts in Machine Learning, and serves as a scientific guide, in generaL In partic- 
ular, the process of building a model for any kind of given data is governed by the 
MDL principle in the majority of cases. The following illustrating example is prob- 
ably familiar to many readers: A Bayesian net (or neural network) is constructed 
from (trained with) some data. We may just determine (train) the net in order to 
fit the data as closely as possible, then we are describing the data very precisely, but 
disregard the description of the net itself. The resulting net is a maximum likelihood 
estimator. Alternatively, we may simultaneously minimize the "residual" descrip- 
tion length of the data given the net and the description length of the net. This 
corresponds to minimizing a regularized error term, and the result is a maximum a 
posteriori or MDL estimator. The latter way of modelling is not only superior to the 
former in most applications, it is also conceptually appealing since it implements 
the simplicity principle, Occam's razor. 

The MDL method has been studied on all possible levels from very concrete 
and highly tuned practical applications up to general theoretical assertions (see e.g. 
[WB68, Ris78, Grii98]). The aim of this work is to contribute to the theory of MDL. 
We regard Bayesian or neural nets or other models as just some particular class of 
models. We identify (probabilistic) models with (semi) measures, data with the 
initial part of a sequence Xi, a;2, . . . , Xt-i, and the task of learning with the problem 
of predicting the next symbol Xt (or more symbols). The sequence Xi, X2, . . . itself is 
generated by some true but unknown distribution fj,. 

An two-part MDL estimator for some string x ~ Xi, . . . , Xf-i is then some short 
description of the semimeasure, while simultaneously the probability of the data 
under the related semimeasure is large. Surprisingly little work has been done on this 
general setting of sequence prediction with MDL. In contrast, most work addresses 
MDL for coding and modeling, or others, see e.g. [BRY98, Ris96, BC91, Ris99]. 
Moreover, there arc some results for the prediction of independently identically 
distributed (i.i.d.) sequences, see e.g. [BC91]. There, discrete model classes are 
considered, while most of the material available focusses on continuous model classes. 
In our work we will study countable classes of arbitrary semimeasures. 

There is a strong motivation for considering both countable classes and semimea- 
sures: In order to derive performance guarantees one has to assume that the model 
class contains the true model. So the larger we choose this class, the less restrictive 
is this assumption. Prom a computational point of view the largest relevant class 
is the class of all lower-semicomputable semimeasures. We call this setup univer- 
sal sequence prediction. This class is at the foundations of and has been intensely 
studied in Algorithmic Information Theory [ZL70, LV97, Cal02]. Since algorithms 
do not necessarily halt on each string, one is forced to consider the more general 
class of semimeasures, rather than measures. Solomonoff [Sol64, Sol78] defined a 
universal induction system, essentially based on a Bayes mixture over this class (see 
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[HutOlb, Hut03a] for recent developments). There seems to be no work on MDL for 
this class, which this paper intends to change. What has been studied intensely in 
[HutOSb] is the so called one-part MDL over the class of deterministic computable 
models (sec also Section 7). 

The paper is structured as follows. Section 2 estabhshes basic definitions. In 
Section 3, we introduce the MDL estimator and show how it can be used for sequence 
prediction in at least three ways. Sections 4 and 5 are devoted to convergence 
theorems. In Section 6, we study the stabilization properties of the MDL estimator. 
The setting of universal sequence prediction is treated in Section 7. Finally, Section 
8 contains the conclusions. 

2 Prerequisites and Notation 

We build on the notation of [LV97] and [HutOSb]. Let the alphabet A" be a finite 
set of symbols. We consider the spaces X* and X'^ of finite strings and infinite 
sequences over X. The initial part of a sequence up to a time tGNort — leNis 
denoted by xi^ or a;<t, respectively. The empty string is denoted by e. 
A semimeasure is a function u : X* [0, 1] such that 



holds. If equality holds in both inequalities of (1), then we have a measure. Let C 
be a countable class of (scmi)measures, i.e. C = {vi : i E 1} with finite or infinite 
index set / C N. A (semi) measure u dominates the class C iff for all G C there is a 
constant c{i'i) > such that p{x) > c(z/j) ■ z/j(a;) holds for all x e X*. The dominant 
semimeasure i> need not be contained in C, but if it is, we call it a universal element 



Let C be a countable class of (scmi)mcasures, where each z/ G C is associated 
with a weight > and w,^ < 1. We may interpret the weights as a prior on 
C. Then it is obvious that the Bayes mixture 



dominates C. Assume that there is some measure G C, the true distribution, 
generating sequences x^^o ^ Normally /x is unknown. (Note that we require 

/X to be a measure, while C may contain also semimeasures in general. This is 
motivated by the setting of universal sequence prediction as already indicated.) If 
some initial part x^t of a sequence is given, the probability of observing x^ & X as 
a next symbol is given by 



(e) < 1 and ^{x) > i^{xa) for all x & X* 



(1) 



of C. 




(2) 



li{xt\x<t) 



if ^x{x<t) > and ^x{xt\x<t) = if ^x{x<t) = 0. (3) 
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The case /i(x<t) = is stated only for well-definedness, it has probabihty zero. Note 

that ^{xt\ depend on a;<f. We may generally define the quantity (3) for any 

function (p : X* [0, 1], we caU (p{xt\x^t) = the (f -prediction. Clearly, this is 

not necessarily a probability on X for general (p. For a semimeasure p in particular, 
the i/-prediction i/(-|a;<t) is a semimeasure on X. 

We define the expectation with respect to the true probability /i: Let n > and 
/ : A'" — > R be a function, then 

E / = E = Ma^i:n)/(a;i:„). (4) 

Generally, we may also define the expectation as an integral over infinite sequences. 
But since we won't need it, we can keep things simple. We can now state a central 
result about prediction with Bayes mixtures in a form independent of Algorithmic 
Information Theory. 

2.1 Theorem. For any class of ( semi) measures C containing the true distribution 
II and any n > 1, we have 

n 2 

{M^<t) - C{a\x^t)) < In w;;\ (5) 

t=l aex 

This was found by Solomonoff ([Sol78]) for universal sequence prediction. A proof 
is also given in [LV97] (only for binary alphabet) or [HutOla] (arbitrary alphabet). 
It is surprisingly simple once Lemma 4.2 is known. A few lines analogous to (8) and 
(9) exploiting the dominance of ^ are sufficient. 

The bound (5) asserts convergence of the .^-predictions to the /^-predictions in 
mean sum (i.m.s.), since we define 

^'^■^^ ^ 3O0: J]Ej](/.(a|a;<i)-(^(a|x<0) <C. (6) 

t=l aex 

Convergence i.m.s. implies convergence with /x-probability one (w./x-p.l), since oth- 
erwise the sum would be infinite. Moreover, convergence i.m.s. provides a rate or 
speed of convergence in the sense that the expected number of times t in which 
(/?(a|a;<t) deviates more than e from fi{a\x^t) is finite and bounded by C/e^ and the 
probability that the number of ^-deviations exceeds ^ is smaller than 6. If the 
quadratic differences were monotonically decreasing (which is usually not the case), 
we could even conclude convergence faster than j. 

2.2 Probabilities vs. Description Lengths. By the Kraft inequality, each 
(semi)measure can be associated with a code length or complexity by means of the 
negative logarithm, where all (binary) codewords form a prefix-free set. The converse 
holds as well. E.g. for the weights Wi, with ^w,^ < 1, codes of lengths [— logaWj,] 
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can be found. It is often only a matter of notational convenience if description 
lengths or probabilities are used, but description lengths are generally preferred in 
Algorithmic Information Theory. Keeping the equivalence in mind, we will develop 
the general theory in terms of probabilities, but formulate parts of the results in 
universal sequence prediction rather in terms of complexities. 



3 MDL Estimator and Predictions 

Assume that C is a countable class of semimeasures together with weights {w,y)u(zc, 
and X E X* is some string. Then the maximizing element u^, often called MAP 
estimator, is defined as 

1/^ — vf(r\ — axgm.ax.{wi,i'{x)} . 

In fact the maximum is attained since for each e e (0, 1) only a finite number 
of elements fulfil Wi,i'{x) > e. Observe immediately the correspondence in terms 
of description lengths rather than probabilities: — argminj,gc{ — \og2w{v) — 
log2Z^(x)}. Then the minimum description length principle is obvious: minimizes 
the joint description length of the model plus the data given the model^ (see the 
last paragraph of the previous section). As explained before, we stick to the product 
notation. 

For notational simplicity we set i'*{x) — ^^{x). The two-part MDL estimator is 
defined by 

q{x) — Q[c\{x) — w,jxi'^[x) — m.ax.{wi,i'{x)}. 

So Q chooses the maximizing element with respect to its argument. We may also 
use the version ^^(x) := Wyyi''^{x) for which the choice depends on the superscript 
instead of the argument. For each x,y E X*, C,{x) > g{x) > gy{x) is immediate. 

We can define MDL predictors according to (3). There are at least three possible 
ways to use MDL for prediction. 

3.1 Definition. The dynamic MDL predictor is defined as 

g{xa) Q^°'{xa) 



Q{a\x) = 



q{x) g^{x) 



That is, we look for a short description of xa and relate it to a short description of 
X = x^f We call this dynamic since for each possible a we have to find a new MDL 
estimator. This is the closest correspondence to the ^-predictor. 

"'^Precisely, wc define a MAP (maximum a posteriori) estimator. For two reasons, information 
theorists and statisticians would not consider our definition as MDL in the strong sense. First, 
MDL is often associated with a specific prior. Second, when coding some data x, one can exploit 
the fact that once the model i^^ is specified, only data which leads to the maximizing element 
needs to be considered. This allows for a description shorter than log2i'^(a;). Since however most 
authors refer to MDL, we will keep using this general term instead of MAP, too. 
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3.2 Definition. The static MDL predictor is given by 



.static/ I „N _ „x^„|„N _ Q^'M _ _ l^'-'ixa) 



Here obviously only one MDL estimator has to be identified, which may be more 
efficient in practice. 

3.3 Definition. The hybrid MDL predictor is given by g^y^{a\x) = ^^J^. This 
can be paraphrased as "do dynamic MDL and drop the weights". It is somewhat 
in-between static and dynamic MDL. 

The range of the static MDL predictor is obviously contained in [0, 1]. For the 
dynamic MDL predictor, this holds by g^{x) > g^"'{x) > g^"'{xa), while for the 
hybrid MDL predictor it is generally false. 

Static MDL is omnipresent in machine learning and applications. In fact, many 
common prediction algorithms can be abstractly understood as static MDL, or 
rather as approximations. Namely, if a prediction task is accomplished by building 
a model such as a neural network with a suitable regularization to prevent "overfit- 
ting" , this is just searching an MDL estimator within a certain class of distributions. 
After that, only this model is used for prediction. Dynamic and hybrid MDL are 
applied more rarely due to their larger computational effort. For example, the sim- 
ilarity metric proposed in [LCL+03] can be interpreted as (a deterministic variant 
of) dynamic MDL. For hybrid MDL, we will see that the prediction properties are 
worse than for dynamic and static MDL. 

We will need to convert our MDL predictors to measures on X by means of 
normalization. U ip : X* ^ [0, 1] is any function, then 

Ynorm\'^\-'^<t) 



(assume that the denominator is different from zero, which is always true with 
probability 1 if <^ is an MDL predictor). This procedure is known as Solomonoff 
normalization ([Sol78, LV97]) and results in fnormi^i-.n) — ^{xv.n) /V'{^)Ni,{x<ri)\i 
where 

is the normalizer. Before proceeding with the theory, an example is in order. 
3.4 Example. Let n e N, A* = {1, . . . , n}, and 

n 

C=[v^{xr.,t) = d,,-...-d,,:deQ] with e = |t?e([0,l]nQ)": J]^i = l} 
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be the set of all rational probability vectors with any prior {w^)0^q. Each e 

© generates sequences x^^o of independently identically distributed (i.i.d) random 
variables such that P{xt = i) = 'di for all t > 1 and 1 < i < n. If Xi-t is the initial 
part of a sequence and a e is defined by ctj = |{s < t : = i}|, then it is easy to 
see that 

j^xi:t ^ argmax{'u;('j?) • exp [ - i-£)(Q;||i?)]} , 

where D(q;||i?) = 'Y^^=i'^i^^^ the Kullhack-Leihler divergence. If \X\ = 2, then 
© is also called a Bernoulli class, and one usually takes the binary alphabet X — 
B = {0, 1} in this case. 



4 Dynamic MDL 

We can start to develop results. It is surprisingly easy to give a convergence proof 
w.p.l of the non-normalized dynamic MDL predictions based on martingales. How- 
ever we omit it, since it docs not include a convergence speed assertion as i.m.s. 
results do, nor does it yield an off-sequence statement about g{a\x<t) for a ^ Xt 
which is necessary for prediction. 

4.1 Lemma. For an arbitrary class of (semi)measures C, we have 

(i) g{x) - ^ g{xa) < ^{x) - J^^(xa) and 

(ii) ^^(x) - ^^^(xa) < ^(x)-^CM 

for all X & X* . In particular, ^ — g is a semimeasure. 
Proof. For all x E X*, with / := ^ — we have 

= Y X^^^i/j^M < Y ^^^(^) ^(^) ~ ^(^) = /(^)- 

ueMMu''} aeX veM\{v'^} 

The first inequality follows from g^{xa) < g{xa), and the second one holds since 
all u are semimeasures. Finally, f{x) = C^{x) — g{x) = '^^^_\4\[^xyW^i'{x) > and 
/(c) = ~ q{^) ^ 1- Hence / is a semimeasure. □ 

4.2 Lemma. Let fi and fi be measures on X , then 

/x(a) 
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See e.g. [HutOla, Sec.3.2] for a proof. 



4.3 Theorem. For any class of (semi)measures C containing the true distribution 
11 and for all n we have 

n 

^ {lj,{a\x<t) - gnorm{a\x<t)f < + Inw^^ 

t=l aeX 

That is, Qnorm{o-\x<t) A*(flk<t) (see (6)), which implies Qnarm{(A^<t) ~^ A*(^^k<t) 
with ^-probability one. 



Proof. From Lemma 4.2, we know 



t=l aeX 



t=l aeX 





x<t) 


Qnormifl 





t=i 



IJ,{xt\x<:t) 

Qnorm 



t=i 



Qixt\x<t) 



(8) 



Then we can estimate 



± Eln^ = E ^.fl'^ = E < .„<. (9) 



t=l 



Q{xi:n) 



since always - <w^^. Moreover, by setting x = x^t, using Inu < u — 1, adding an 
always positive max-term, and finally using ^ < again, we obtain 



Eln^4^^<E 



g{x) 



Xx)=t-1 



g{x) 



< 



E 



Xx)=t-1 



fJ^ix) {JZaex - + max {O, g{x) - J2a&x } 



g{x) 



e(x)=t-i 



g{xa)^ — g{x) + max |o, g{x) — ^)(a;a)| 

We proceed by observing 

n 

E E [(E^m)-^H=E[E^^(^)-E^( 



(10) 



t=i e{x)=t-i aex 



t=l l{x)=t l{x)=t-l 



l{x)=n 



(11) 
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which is true since for successive t the positive and negative terms cancel. Prom 
Lemma 4.1 we know g{x) — X^ae^r q{^'^) ^ ~ X^ae^r ^(•^^) therefore 

n n 

^^max|o,^(x) -^^(xa)} < ^^max |o,^(x) - ^^(ra)} 



t=i e{x)=t-i 



t=l t{x)=t-l 



:i2) 



Here we have again used the fact that positive and negative terms cancel for suc- 
cessive t, and moreover the fact that ^ is a semimeasure. Combining (10), (11) and 
(12), and observing ^» < ^ < 1, we obtain 



t=i 



Eg Q{x<ta) . 1 

g{x<t) - ' 



e{x)=n 



Therefore, (8), (9) and (13) finally prove the assertion. 



(13) 
□ 



This is the first convergence result in mean sum, see (6). It implies both on- 
sequence and off-sequence convergence. Moreover, it asserts the convergence is "fast" 
in the sense that the sum of the total expected deviations is bounded by w^^+ln w~^. 



Of course, 



w 



-1 



can be very large, namely 2 to the power of complexity of /x. The 



following example will show that this bound is sharp (save for a constant factor). 
Observe that in the corresponding result for mixtures. Theorem 2.1, the bound is 
much smaller, namely Iniu"^ = complexity of 

4.4 Example. Let X — {0, 1}, > 1 and C = {z/i, . . . , i^n-i-, A*}- Each i/j is a de- 
terministic measure concentrated on the sequence l'~^0°°, while the true distribution 
// is deterministic and concentrated on x<oo = 1°°- Let w 

/i generates a;<ooi and for each t < iV — 1 we have Qn 



w,. 



jj: for all i. Then 
1 



1 X 



Hence, Y.t'^Y.ail^i'^\^<i) ~ 0norm{a\x^t)y = |(iV-l) ?a for large A^. Here, fi 

is Bernoulli, while the Ui are not. It might be surprising at a first glance that there 
are even classes C containing only Bernoulli distributions, where the exponential 
bound is sharp [PH04] . 



4.5 Theorem. For any class of (semi) measures C containing the true distribution 
II, we have 



oo 



aeX 



< 2w„ ^ and 



y^Ey^ |g^o^^(a|a:<t) - g{a\x<t) = ^'E^l g{a\x<t] 

t=l aeX t=l aeX 



< 2w-\ 



9 



Consequently, ^(a|a;<t) — >' //(a|x<t), and for almost allx^oo £ the normalizer 
Ng defined in (7) converges to a number which is finite and greater than zero, i.e. 
< Ng{x<^) < oo. 



Proof, (i) Define — max{0, u} for h e M, tlien for x :— x<t G A"* ^ we have 



Ehn^ e{ 



a\x] 



< 



E 



In 



q{x) 



E 



In 



In 



q{x) 



Eg Qi.Xa) 



= E 



Ea 



e{x)=t-i ^ ' e{x)=t-i 

£(2:)=t-l £(x)=t-l 

= ^ [Ea ^(^«) - Qi^) + 2(^(2^) - Ea ■ 

e{x)=t-i 

Here, \u\ = + {~'^)^ = + 2m"'", Inw < -u — 1, and ^ > w^/i have been used, the 
latter imphes also Ea ^?(^'^) — "^MEa/^l^*^) ~ Wf^fi{x). The last expression in this 
(in)equality chain, when summed over t = l...(X) is bounded by 2w^^ by essentially 
the same arguments (10) - (13) as in the proof of Theorem 4.3. 

(a) Let again x :— x^t a-nd use QnormiO'lx) — ^'(^k)/ Eb ^(^k) obtain 



g(a\x) 

(Ea^M-^(^))^ (^(a:^) - Ea^M) 



^(x) 



+ 



^(x) 



Then take the expectation E and the sum Et^i proceed as in (i). Finally, 
^(a|a;<t) /i(a|a;<t) follows by combining (ii) with Theorem 4.3, and by (i). 

El U^^^^^^^l is bounded in n with //-probability 1, thus the same is true for 

ln7V,(x<ooHErin^-^5^. □ 



5 Static MDL 

So far, we have considered dynamic MDL from Definition 3.1. We turn now to the 
static variant (Definition 3.2), which is usually more efficient and thus preferred in 
practice. 
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5.1 Theorem. For any class of (semi) measures C containing the true distribution 
II, we have 



t=l aeX 



t=l 



aex 



Proof. We proceed in a similar way as in the proof of Theorem 4.3, (10) - (12). 
Prom Lemma 4.1, we know q{x) — Ylia^^i-'^^) — ^l-^) ~ Sa^l-*^*^)- Then 

Qi-'-'^u) - J2atx L>'"^'('>'<ta) 



5^E|l-^^--(a|a;<,)|=5^ E 

t=i 



t=l aeX 

n 



J2 

t=i e(x)=t-i 



£ < E E 

t=i e{x)=t-i 



g{x) 



n 



aex 



< w~ 



t=i e{x)=t-i aex 

£(x)=n 



for all n eN. This implies the assertion. Again we have used ^ < and the fact 
that positive and negative terms cancel for successive t. □ 



5.2 Corollary. Let C contain the true distribution ix, then 

EfEEa {Mx<t) - Q{a\x<t)f < 

EtEEa il^ia\x^t) ~ Ql<Mx<t)f < 

Proof. This follows by combining the assertions of Theorems 4.3 - 5.1 with 

the triangle inequality. For static MDL, use in addition Eal^'('^l'^) ~ ^?'^(a|a;)| = 
I g{a\x) — ^^(a|a;)| < | q{(i\x) — 1| + |1 — ^^(a|a;)| which follows from 
g{xa) > (f{xa). □ 

This corollary recapitulates our results and states convergence i.m.s (and there- 
fore also with /i-probability I) for all combinations of un-normalized/normalized and 
dynamic/static MDL predictions.^ 

^We briefly discuss the choice of the total expected square error for measuring speed of conver- 
gence. The expected Kullback-Leibler distance may seem more natural in the light of our proofs. 
However, this quantity behaves well only under dynamic MDL, not static MDL. To see this, let C 
be the class of all computable Bernoulli distributions and /i the measure having /i(0) — /i(l) ~ ^• 
Then the sequence a; = 0" has nonzero probability. For sufflciently large n, — vq holds (typically 
already for small n), where Vf) is the distribution generating only 0. Then = oo, and the 

expectation is oo, too. The quadratic distance behaves locally like the Kullback-Leibler distance 
(Lemma 4.2), but otherwise is bounded and thus more convenient. 
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6 Hybrid MDL and Stabilization 



We now turn to the hybrid MDL variant (see Definition 3.3). So far we have not 
cared about what happens if two or more (semi)measures obtain the same value 
w„i'{x) for some string x. In fact, for the previous results, the tie-breaking strategy 
can be completely arbitrary. This need not be so for all thinkable prediction methods 
other than static and dynamic MDL, as the following example shows. 

6.1 Example. Let X — M and C contain only two measures, the uniform measure 
A which is defined by X{x) — 2~^^^\ and another measure v having u{lx) — 2~^^^^ 
and u^Ox) = 0. The respective weights arc Wx = | and w^, = |. Then, for each 
X starting with 1, we have WyU^x) = WxX{x) = |2^^*^^)+^. Therefore, for all x^^o 
starting with 1 (a set which has uniform measure |), we have a tie. If the maximizing 
element v* is chosen to be A for even t and u for odd t, then both static and dynamic 
MDL constantly predict probabilities of | for all a e B. However, the hybrid MDL 
predictor values ^J^^^^-^ oscillate between | and 1. 

If the ambiguity in the tie-breaking process is removed, e.g. if always the measure 
with the larger weight is been chosen, then the hybrid MDL predictor does 
converge for this example. If there are more (semi) measures in the class and there 
remains still a tie of shortest programs, an arbitrary program can be selected, since 
then the respective measures are equal, too. In the following, we assume that this 
tie-breaking rule is applied. 

Do the hybrid MDL predictions always converge then? This is equivalent to 
asking if the process of selecting a maximizing element eventually stabilizes. If there 
is no stabilization, then hybrid MDL will necessarily fail as soon as the weights are 
not equal. A possible counterexample could consist of two measures the fraction of 
which oscillates perpetually around a certain value. This can indeed happen. 

6.2 Example. Let X be binary, ^{x) — HiSi A*i(^i) ^'^'^ ^(^) — 11^=1 with 

= 1 - 2-4§l and = 1 - 2-4^1+\ 

Then one can easily see that //(111 . . .) = Hr l^iiX) > 0' ■■■)^IIT ^^(1) > 0> 

and is convergent but oscillates around its limit. Therefore, we can set 

Wf^ and w,y appropriately to prevent the maximizing element from stabilizing on 
3^<oo = 111-- - (Moreover, each sequence having positive measure under /j, and u 
contains eventually only ones, and the quotient oscillates.) 

The reason for the oscillation in this example is the fact that measures fi and z/ 
are asymptotically very similar. One can also achieve a similar effect by constructing 
a measure which is dependent on the past. This shows in particular that we need 
both parts of the following definition which states properties sufficient for a positive 
result. 
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6.3 Definition, (i) A (semi) measure v on is called factorizable if there are 
(semi)measures Ui on X such that iy{x) = Y[f=i ^t(^i) ^ ^ '^*- That is, the 
symbols of sequences x^^o generated by u are independent. 

{ii) A factorizable (semi)measure = J| /Xj is called uniformly stochastic, if there is 
some S > such that at each time i the probability of all symbols a e A' is either 
or at least 6. That is, //i(a) > ^ A*i(a) > S for all a e A" and i > 1. 

In particular, all deterministic measures are uniformly stochastic. Another sim- 
ple example of a uniformly stochastic measure is a probability distribution which 
generates alternately random bits by fair coin flips and the digits of the binary 
representation of tt. 

6.4 Theorem. Let C be a countable class of factorizable (semi)measures and be 
uniformly stochastic. Then the maximizing element stabilizes almost surely. 

We omit the proof. So in particular, under the conditions of Theorem 6.4, the hy- 
brid MDL predictions converge almost surely. No statement about the convergence 
speed can be made. 

7 Complexities and Randomness 

In this section, we concentrate on universal sequence prediction. It was mentioned 
already in the introduction that this is one interesting application of the theory 
developed so far. So C = 7W is the countable set of all enumerable (i.e. lower semi- 
computable) semimeasures on X* . (Algorithms are identified with semimeasures 
rather than measures since they need not terminate.) }A contains stochastic mod- 
els in general, and in particular all models for computable deterministic sequences. 
One can show that this class Ai is determined by all algorithms on some fixed uni- 
versal monotone Turing machine U [LV97, Th. 4.5.2]. By this correspondence, each 
scmimcasure v E Ad is assigned a canonical weight Wi, = 2^-^^"^ (where K(i') is the 
Kolmogorov complexity of z/, see [LV97, Eq. 4.11]), and < 1 holds. We will 

assume programs to be binary, i.e. p e B*, in contrast to outputs, which are strings 
X e X*. 

The MDL definitions in Section 3 directly transfer to this setup. All our results 
(Theorems 4.3 - 5.1) therefore apply to g = g[M] if the true distribution /x is a mea- 
sure, which is not very restrictive. Then /j, is necessarily computable. Also, Theorem 
2.1 implies Solomonoff's important universal induction theorem: ^ converges to the 
true distribution i.m.s., if the latter is computable. Note that the Bayes mixture 
^ is within a multiplicative constant of the Solomonoff-Levin prior M{x), which is 
the algorithmic probability that U produces an output starting with x if its input 
is random. 

In addition to Al, we also consider the set of all recursive measures M. together 
with the same canonical weights, and the mixture ^{x) — X^i^ejw '"^i^^(^)- Likewise, 
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define q = Q^j^y Then we obviously have g{x) < Ci^) < C(^) Qi^) ^ ^(^) for 

X X 

all X e X*. It is even immediate that ^{x) < g{x) since ^ e A^. Here, hj f < g we 

X ^ 

mean f < g ■ 0(1), ">" and '— " are defined analogously. 

Moreover, for any string x & X*, there is also a universal one-part MDL estimator 
m{x) = 2~^"*(^^ derived from the monotone complexity Km{x) — mm{i(p) :U{p) — 
X*}. (I.e. the monotone complexity is the length of the shortest program such that 
C/'s output starts with x.) The minimal program p defines a measure u with i>{x) = 1 

X 

and > 2-^^P> -0(1) (recaU that programs are binary). Therefore, m{x) < g{x) 
for all X & X*. Together with the following proposition, we thus obtain 

m{x) = g{x) < i{x) < g{x) = ^{x) for all x e X*. (14) 

X 

7.1 Proposition. We have g(x) < m(x) for all x e X*. 

Proof. (Sketch only.) It is not hard to show that given a string x G X* and a 
recursive measure u (which in particular may be the MDL descriptor z/*(a;)) it is 
possible to specify a program p of length at most —log2W,^ — log2i'{x)-\-c that outputs 
a string starting with x, where constant c is independent of x and v. This is done 
via arithmetic encoding. Alternatively, it is also possible to prove the proposition 

X 

indirectly using [LV97, Th.4.5.4]. This implies that m{x) > w^iy{x) for all x E X* 

X 

and all recursive measures ly. Then, also m{x) > max{wj,z/(x)} holds. □ 

On the other hand, we know from [Gac83] that m ^ ^. Therefore, at least one 
of the two inequalities in (14) must be proper. 

X ~ ~ X 

7.2 Problem. Which of the inequalities g < ^ and ^ < g is proper (or are both)? 
Equation (14) also has an easy consequence in terms of randomness criteria. 

7.3 Proposition. A sequence x^oo G X^ is Martin-Lof random with respect to 
some computable measure iff for any f e {m, g, ^, r, M} there is a constant C > 
such that f{xi:n) < Cii{xi:n) for all n & N holds. 

Proof. It is a standard result that if x<oo is random then M(xi:„) < C/x(xi:„) 

X 

for some C [Lev73, Th.3]. Then by (14), f{xi;n) < A*(2^i:n) for all /. Conversely, if 

X 

f{xv.n) < A*(2^i:n) for some /, then there is C such that m{x-i;n) < C^{xi:n)- This 
implies //-randomness of x<oo ([Lev73, Th.2] or [LV97, p295]). □ 

Interestingly, these randomness criteria partly depend on the weights. The cri- 
teria for ^ and g are not equivalent any more if weights other than the canonical 
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weights are used, as the following example will show. In contrast, for ^ and q there 
is no weight dependency as long as the weights are strictly greater than zero, since 

7.4 Example. There are other randomness criteria than Martin-Lof randomness, 
e.g. rec-randomness. A rec-random sequence x^^o (with respect to the uniform 
distribution) satisfies z/(a:i:„) < c(z/)2"" for each computable measure v and for all n. 
It is obvious that Martin-Lof random sequences are also rec-random. The converse 
does not hold, there are sequences x^^o that are rec-random but not Martin-Lof 
random, as shown e.g. in [Sch71, Wan96]. 

Let a;<oo be such a sequence, i.e. f^Xi.n) < c(z/)2~"^ for all computable measures 
V and for all n, but where a;<oo is not Martin-Lof random. Let z/i, z/2, . . . be a (non- 
effective) enumeration of all computable measures. Define w[ = 2~^c{yi)~^. Then 

oo oo 

M'{x^.,n) = Y,<^^i^^--n) < 5^2-V(i/,)"'c(z/,)2-" = 2-" for all n, 

i=l i=l 

i.e. a;<oo is M'-random. Thus, x<oo is also f'-random with f' — maxj{w^z/j}. 

8 Conclusions 

We have proven convergence theorems for MDL prediction for arbitrary countable 
classes of semimeasures, the only requirement being that the true distribution /i 
is a measure. Our results hold for both static and dynamic MDL and provide a 
statement about convergence speed in mean sum. This also yields both on-sequence 
and off-sequence assertions. Our results are to our knowledge the strongest available 
for the discrete case. 

Compared to the bound for Solomonoff prediction in Theorem 2.1, the error 
bounds for MDL arc exponentially worse, namely w^^ instead of In w^^. Our bounds 
are sharp in general, as Example 4.4 shows. There are even classes of Bernoulli 
distributions where the exponential bound is sharp [PH04]. 

In the case of continuously parameterized model classes, finite error bounds do 
not hold [BC91, BRY98], but the error grows slowly as Int. Under additional as- 
sumptions (i.i.d. for instance) and with a reasonable prior, one can prove similar 
behavior of MDL and Bayes mixture predictions [Ris96]. In this sense, MDL con- 
verges as fast as Bayes mixture, and this is even true for the "slow" Bernoulli example 
presented in [PH04]. However in Example 4.4, the error grows as t, which shows 
that the Bayes mixture may be superior to MDL in general. 
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